Decoding Consumer Sentiment: Analyzing Trends in Yelp Reviews

Review data is often used by consumers to make many decisions, such as where they should eat and what products they should purchase. It is also used by businesses for purposes such as getting feedback and comparing themselves with their competitors. When using this data, it is important to understand trends in the data so you can more accurately make decisions based on what you see. I looked at Yelp data spanning from 2005 to 2022 to see what trends I could find. I analyzed the sentiment of around 50,000 reviews and found expected insights such as reviews that registered as more positive tended to be associated with higher star ratings. I also found less intuitive insights like reviews with lower star ratings tended to be more neutral, with lower variability in their tone compared with high starred reviews, which tended to be more polarized, but with higher variability. Finally, I selected a few categories of businesses and and explored how their star ratings and tone compared to each other and varied over time.

In [159]:
# # Extract data from tarfile
# with tarfile.open("yelp_dataset.tgz", "r") as all_data:
#     # Extract all members of the archive
#     all_data.extractall(filter="tar")

# # Open and read the review JSON file

# review_rows = []

# with open("yelp_academic_dataset_review.json", "r", encoding="utf-8") as file:
#     for line in file:
#         # Load each line as a separate JSON object (row)
#         row = json.loads(line)
#         review_rows.append(row)

# # Convert the list of rows into a pandas DataFrame
# review = pd.DataFrame(review_rows)

# # Open and read the business JSON file

# business_rows = []

# with open(
#     "yelp_academic_dataset_business.json", "r", encoding="utf-8"
# ) as file:
#     for line in file:
#         # Load each line as a separate JSON object (row)
#         row = json.loads(line)
#         business_rows.append(row)

# # Convert the list of rows into a pandas DataFrame
# business = pd.DataFrame(business_rows)
# # Take a sample of the businesses
# business_sample = business.sample(n=1000, random_state=1)

# # Combine the two datasets

# combined = business_sample.merge(
#     review, how="inner", on="business_id", suffixes=("_business", "_review")
# )

# # Convert the combination into a csv

# combined.to_csv("yelp_combined.csv", index=False)

The Yelp data contained many data points, which made running any code take longer than reasonable. Because of this, I decided to take a random sample of 1,000 business from the business dataset and join it with the review dataset, so I had all the reviews for each business in my random sample.

In [ ]:
# Import libraries
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import seaborn as sns
import plotly.express as px
import plotly.io as pio
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from IPython.display import display
from textblob import TextBlob
import ipywidgets as widgets
import scipy.stats as stats
import scikit_posthocs as sp
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.stats.multicomp import pairwise_tukeyhsd
import json
from itertools import combinations
import statsmodels.stats.multitest as smm

%matplotlib inline

pio.templates.default = "plotly_white"

I then did a bit of feature engineering by creating a new variable that represented the length of each review. I also calculated the polarity, how negative (-1) or positive (1) the text is, and subjectivity, how factual (0) or opinionated (1) the text is, of each review using the TextBlob library. I also created a new variable, Abs-polarity, which measured the strength of the polarity by taking the absolute value of the polarity.

In [161]:
# Download data

combined = pd.read_csv("yelp_combined.csv", encoding="utf-8")
In [162]:
# # Calculating the length of each review
# combined["Review Length"] = combined["text"].apply(len)

# # Loop through each row and calculate sentiments
# for index, text in combined["text"].items():
#     blob = TextBlob(text)
#     combined.at[index, "Polarity"] = blob.sentiment.polarity
#     combined.at[index, "Subjectivity"] = blob.sentiment.subjectivity

# # Calculate polarity strength
# combined["Abs_polarity"] = abs(combined["Polarity"])

Below are the first five rows of the dataset as well as some preliminary statistics. I also checked for null values. The only variables that had null values were address, attributes, and hours, none of which I use in this analysis.

In [163]:
# Display the first five rows

display(combined.head())

# Show some preliminary statistics

display(combined.describe())

# Check for null values

print(combined.isnull().sum())
business_id name address city state postal_code latitude longitude stars_business review_count ... funny cool text date Review Length Star Sentiment Polarity Subjectivity Abs_polarity Sentiment
0 L_f14MSPdkgHI81mN9--bw Luca Italian Leather 100 2nd Ave NE Saint Petersburg FL 33701 27.7733 -82.633806 5.0 5 ... 1 1 This neat store opened not too long ago. They... 2016-03-03 01:25:38 629 Positive 0.347708 0.678889 0.347708 Positive
1 L_f14MSPdkgHI81mN9--bw Luca Italian Leather 100 2nd Ave NE Saint Petersburg FL 33701 27.7733 -82.633806 5.0 5 ... 0 0 Phenomenal quality and service. Willing to acc... 2021-03-17 22:47:11 85 Positive 0.609375 0.716667 0.609375 Positive
2 L_f14MSPdkgHI81mN9--bw Luca Italian Leather 100 2nd Ave NE Saint Petersburg FL 33701 27.7733 -82.633806 5.0 5 ... 0 0 Great store, unique high quality Italian leath... 2020-02-13 18:25:08 233 Positive 0.390000 0.578750 0.390000 Positive
3 L_f14MSPdkgHI81mN9--bw Luca Italian Leather 100 2nd Ave NE Saint Petersburg FL 33701 27.7733 -82.633806 5.0 5 ... 0 0 My husbanded needed shoes four our wedding. Sa... 2021-02-14 18:53:03 312 Positive 0.215260 0.558157 0.215260 Positive
4 L_f14MSPdkgHI81mN9--bw Luca Italian Leather 100 2nd Ave NE Saint Petersburg FL 33701 27.7733 -82.633806 5.0 5 ... 0 0 Gorgeous leather handbags, jackets and shoes. ... 2021-03-21 18:32:32 155 Positive 0.700000 0.950000 0.700000 Positive

5 rows × 28 columns

latitude longitude stars_business review_count is_open stars_review useful funny cool Review Length Polarity Subjectivity Abs_polarity
count 50363.000000 50363.000000 50363.000000 50363.00000 50363.000000 50363.000000 50363.000000 50363.000000 50363.000000 50363.000000 50363.000000 50363.000000 50363.000000
mean 35.559602 -90.478860 3.786351 277.43657 0.817823 3.785477 1.172508 0.325139 0.506106 570.645573 0.248511 0.564448 0.288057
std 5.491401 15.121928 0.751343 305.53432 0.385994 1.465665 2.812676 1.363873 2.050777 532.092638 0.237985 0.134091 0.188194
min 27.688229 -119.887051 1.000000 5.00000 0.000000 1.000000 0.000000 0.000000 0.000000 17.000000 -1.000000 0.000000 0.000000
25% 29.938914 -90.337143 3.500000 48.00000 1.000000 3.000000 0.000000 0.000000 0.000000 230.000000 0.109670 0.484946 0.143234
50% 36.325741 -86.188308 4.000000 152.00000 1.000000 4.000000 0.000000 0.000000 0.000000 409.000000 0.254762 0.562143 0.266667
75% 39.910601 -82.287938 4.500000 381.00000 1.000000 5.000000 1.000000 0.000000 0.000000 721.000000 0.394643 0.644444 0.400000
max 53.631919 -74.700195 5.000000 1291.00000 1.000000 5.000000 132.000000 77.000000 131.000000 5000.000000 1.000000 1.000000 1.000000
business_id          0
name                 0
address           1027
city                 0
state                0
postal_code          0
latitude             0
longitude            0
stars_business       0
review_count         0
is_open              0
attributes        1033
categories           0
hours             2667
review_id            0
user_id              0
stars_review         0
useful               0
funny                0
cool                 0
text                 0
date                 0
Review Length        0
Star Sentiment       0
Polarity             0
Subjectivity         0
Abs_polarity         0
Sentiment            0
dtype: int64

To get a sense of the variables I was working with, I first used bar graphs and histograms to get a sense of their distributions. As you can see, the most common star rating that a review would give was a 5. 4 was the next highest, closely followed by 1. Not as many reviews gave a rating of 2 or 3. As one might expect, the distribution of average star rating of a business looks more normal. It’s centered at 4, with more extreme values being less common. Given that 5 is the highest possible rating, the distribution is not symmetric, since the right tail is cut off.

In [164]:
# Bar Plot of the distribution of stars
fig, axes = plt.subplots(1, 2, figsize=(12, 6))
sns.set_style("white")

sns.countplot(data=combined, x="stars_review", color="darkviolet", ax=axes[0])
axes[0].set_title("Distribution of Review Star Ratings")
axes[0].set_xlabel("Review Star Rating")
axes[0].set_ylabel("Count")
axes[0].grid(True, axis="y")

sns.countplot(
    data=combined, x="stars_business", color="darkviolet", ax=axes[1]
)
axes[1].set_title("Distribution of Business Star Ratings")
axes[1].set_xlabel("Business Star Rating")
axes[1].set_ylabel("Count")
axes[1].grid(True, axis="y")

plt.tight_layout()
plt.show()
No description has been provided for this image

The distributions of the review polarities and subjectivities also appear approximately normal. The polarity appears relatively symmetric and centered around 0.25. The subjectivity appears relatively symmetric and centered around 0.55. Both polarity and subjectivity seem to have centers that are slightly higher than the midpoint of their respective ranges. Polarity trends slightly more positive and subjectivity trends slightly more opinionated than factual. Absolute polarity is not as similar. The most common polarity strength is around 2.25, but there is a clear right skew to the data, with more lower absolute polarities than higher. This means that reviews tended to be more balanced and less strictly positive or negative.

In [165]:
# Histogram of the distribution of Polarity and Subjectivity

fig, axes = plt.subplots(1, 3, figsize=(21, 7))
sns.set_style("white")

sns.histplot(
    data=combined, x="Polarity", color="darkviolet", kde=True, ax=axes[0]
)
axes[0].set_title("Distribution of Review Polarity")
axes[0].set_xlabel("Review Polarity")
axes[0].set_ylabel("Count")
axes[0].grid(True, axis="y")

sns.histplot(
    data=combined, x="Subjectivity", color="darkviolet", kde=True, ax=axes[1]
)
axes[1].set_title("Distribution of Review Subjectivity")
axes[1].set_xlabel("Review Subjectivity")
axes[1].set_ylabel("Count")
axes[1].grid(True, axis="y")

sns.histplot(
    data=combined, x="Abs_polarity", color="darkviolet", kde=True, ax=axes[2]
)
axes[2].set_title("Distribution of Absolute Polarity")
axes[2].set_xlabel("Absolute Polarity")
axes[2].set_ylabel("Count")
axes[2].grid(True, axis="y")

plt.tight_layout()
plt.show()
No description has been provided for this image

Each business came with many different categories, some of which were not very distinct (ex. Restaurants vs. Food). In order to have more digestible insights, I narrowed it down to six categories that I thought were both common and distinct from each other. Below is a bar graph of the count of each category. Although the “Restaurant” category has the most data points by far at 35,093 points, the other categories have plenty of data points themselves, with the least frequent, “Hotels & Travel”, having 1,601.

In [166]:
# Convert the 'categories' column into a list
combined["categories"] = combined["categories"].str.split(",")
In [167]:
# Exploding category column to get individual categories and then count them
# Convert the 'categories' column into a list
combined["categories"] = combined["categories"].apply(
    lambda x: [category.strip().title() for category in x]
)

categories_to_filter = [
    "Restaurants",
    "Event Planning & Services",
    "Shopping",
    "Beauty & Spas",
    "Arts & Entertainment",
    "Hotels & Travel",
]
combined_filtered = combined[
    combined["categories"].apply(
        lambda x: any(cat in x for cat in categories_to_filter)
    )
]

combined_filtered_exploded = combined_filtered.explode("categories")

combined_filtered_exploded = combined_filtered_exploded[
    combined_filtered_exploded["categories"].isin(categories_to_filter)
].reset_index(drop=True)

category_order = combined_filtered_exploded["categories"].value_counts().index

# display(combined_filtered_exploded["categories"].value_counts())
In [168]:
# Create barplot of business categories

sns.countplot(
    data=combined_filtered_exploded,
    y="categories",
    order=category_order,
    color="darkviolet",
    zorder=10,
)

plt.title("Distribution of Main Business Categories")


plt.ylabel("Category")


plt.xlabel("Count")


plt.grid(axis="x", zorder=0, lw=0.5)
No description has been provided for this image

Despite the star rating being the intended polarity of a review, sometimes the content of a review doesn’t fully match that. The below graphs explore the relationships between star rating and the polarity, subjectivity, and polarity strength of the reviews.
The overall trend between polarity and star rating is generally as expected: a higher star rating is associated with a higher polarity. It is interesting to note, however, that even for a star rating of 1, the average polarity is only slightly below 0, indicating only a vaguely negative tone.
The relationship between star rating and subjectivity is less pronounced, but also shows a positive relationship. A higher star rating correlates with more subjectivity. That isn’t the only pattern, though. The spread of subjectivities is greater for more extreme star ratings than it is for the middle star ratings. In other words, more extreme star ratings were associated with both higher and lower subjectivities while the middle star ratings tended to be more consistent.
Finally, the relationship between star rating and polarity strength (absolute polarity) is also positive. This is somewhat unexpected because it means that lower star ratings tended to be more neutral than higher star ratings, when you might expect the middle star ratings to be more neutral. However, this is consistent with the first graph, which showed the lower star ratings’ polarities centered around 0, and moving upwards (one direction of more extreme) from there. Additionally, the distribution of polarity strength is more bulbous for star ratings of 1 and 2, and more spread for the higher star ratings. This indicates that for lower star ratings, polarity strength is more consistent while for the higher star ratings, it’s more variable.

In [169]:
# Boxplots of Review Star Ratings with Sentiments

fig, axes = plt.subplots(1, 3, figsize=(21, 7))

# Plot for Abs_polarity
sns.boxplot(
    data=combined,
    x="stars_review",
    y="Abs_polarity",
    color="white",
    fliersize=1,
    linecolor="black",
    ax=axes[2],
    zorder=0,
)
sns.violinplot(
    data=combined,
    x="stars_review",
    y="Abs_polarity",
    color="darkviolet",
    ax=axes[2],
    alpha=0.25,
    inner=None,
    zorder=10,
)
axes[2].set_title("Absolute Polarity vs. Star Rating")
axes[2].set_xlabel("Star Rating")
axes[2].set_ylabel("Absolute Polarity")

# Plot for Polarity
sns.boxplot(
    data=combined,
    x="stars_review",
    y="Polarity",
    color="white",
    fliersize=1,
    linecolor="black",
    ax=axes[0],
    zorder=0,
)
sns.violinplot(
    data=combined,
    x="stars_review",
    y="Polarity",
    color="darkviolet",
    ax=axes[0],
    alpha=0.25,
    inner=None,
    zorder=10,
)
axes[0].set_title("Polarity vs. Star Rating")
axes[0].set_xlabel("Star Rating")

# Plot for Subjectivity
sns.boxplot(
    data=combined,
    x="stars_review",
    y="Subjectivity",
    color="white",
    fliersize=1,
    linecolor="black",
    ax=axes[1],
    zorder=0,
)
sns.violinplot(
    data=combined,
    x="stars_review",
    y="Subjectivity",
    color="darkviolet",
    ax=axes[1],
    alpha=0.25,
    inner=None,
    zorder=10,
)
axes[1].set_title("Subjectivity vs. Star Rating")
axes[1].set_xlabel("Star Rating")


plt.tight_layout()
plt.show()
No description has been provided for this image
In [170]:
# Boxplots of Business Categories with Sentiments

fig, axes = plt.subplots(1, 3, figsize=(21, 7))

# Plot for Abs_polarity
sns.boxplot(
    data=combined_filtered_exploded,
    y="categories",
    x="Abs_polarity",
    order=category_order,
    color="white",
    fliersize=1,
    linecolor="black",
    ax=axes[2],
    zorder=0,
)
sns.violinplot(
    data=combined_filtered_exploded,
    y="categories",
    x="Abs_polarity",
    order=category_order,
    color="darkviolet",
    ax=axes[2],
    alpha=0.25,
    inner=None,
    zorder=10,
)
axes[2].set_title("Absolute Polarity vs. Business Category")
axes[2].set_ylabel("Business Category")
axes[2].set_xlabel("Absolute Polarity")

# Plot for Polarity
sns.boxplot(
    data=combined_filtered_exploded,
    y="categories",
    x="Polarity",
    order=category_order,
    color="white",
    fliersize=1,
    linecolor="black",
    ax=axes[0],
    zorder=0,
)
sns.violinplot(
    data=combined_filtered_exploded,
    y="categories",
    x="Polarity",
    order=category_order,
    color="darkviolet",
    ax=axes[0],
    alpha=0.25,
    inner=None,
    zorder=10,
)
axes[0].set_title("Polarity vs. Business Category")
axes[0].set_ylabel("Business Category")

# Plot for Subjectivity
sns.boxplot(
    data=combined_filtered_exploded,
    y="categories",
    x="Subjectivity",
    order=category_order,
    color="white",
    fliersize=1,
    linecolor="black",
    ax=axes[1],
    zorder=0,
)
sns.violinplot(
    data=combined_filtered_exploded,
    y="categories",
    x="Subjectivity",
    order=category_order,
    color="darkviolet",
    ax=axes[1],
    alpha=0.25,
    inner=None,
    zorder=10,
)
axes[1].set_title("Subjectivity vs. Business Category")
axes[1].set_ylabel("Business Category")


plt.tight_layout()
plt.show()
No description has been provided for this image
In [171]:
# Test ANOVA Assumptions for Polarity

model1 = smf.ols(
    "Polarity ~ C(categories)", data=combined_filtered_exploded
).fit()
model2 = smf.ols(
    "Subjectivity ~ C(categories)", data=combined_filtered_exploded
).fit()
model3 = smf.ols(
    "Abs_polarity ~ C(categories)", data=combined_filtered_exploded
).fit()

residuals1 = model1.resid
fitted1 = model1.fittedvalues
residuals2 = model2.resid
fitted2 = model2.fittedvalues
residuals3 = model3.resid
fitted3 = model3.fittedvalues

fig, axes = plt.subplots(3, 2, figsize=(12, 12))

sns.residplot(
    x=fitted1,
    y=residuals1,
    lowess=True,
    line_kws={"color": "red"},
    ax=axes[0, 0],
)
axes[0, 0].set_xlabel("Fitted Values")
axes[0, 0].set_ylabel("Residuals")
axes[0, 0].set_title("Residuals vs. Fitted Values")

sm.qqplot(residuals1, line="s", ax=axes[0, 1])
axes[0, 1].set_title("QQ Plot of Residuals")

sns.residplot(
    x=fitted2,
    y=residuals2,
    lowess=True,
    line_kws={"color": "red"},
    ax=axes[1, 0],
)
axes[1, 0].set_xlabel("Fitted Values")
axes[1, 0].set_ylabel("Residuals")
axes[1, 0].set_title("Residuals vs. Fitted Values")

sm.qqplot(residuals2, line="s", ax=axes[1, 1])
axes[1, 1].set_title("QQ Plot of Residuals")

sns.residplot(
    x=fitted3,
    y=residuals3,
    lowess=True,
    line_kws={"color": "red"},
    ax=axes[2, 0],
)
axes[2, 0].set_xlabel("Fitted Values")
axes[2, 0].set_ylabel("Residuals")
axes[2, 0].set_title("Residuals vs. Fitted Values")

sm.qqplot(residuals3, line="s", ax=axes[2, 1])
axes[2, 1].set_title("QQ Plot of Residuals")

plt.suptitle("ANOVA Assumptions", fontsize=16)
fig.text(
    0.5, 0.925, "Polarity vs. Business Category", ha="center", fontsize=14
)
fig.text(
    0.5, 0.62, "Subjectivity vs. Business Category", ha="center", fontsize=14
)
fig.text(
    0.5,
    0.31,
    "Absolute Polarity vs. Business Category",
    ha="center",
    fontsize=14,
)

plt.tight_layout(h_pad=3.5)
fig.subplots_adjust(top=0.9)
plt.show()
c:\Users\emily\OneDrive\Documents\Python Study\Projects\main\Lib\site-packages\statsmodels\nonparametric\smoothers_lowess.py:226: RuntimeWarning: invalid value encountered in divide
  res, _ = _lowess(y, x, x, np.ones_like(x),
No description has been provided for this image
In [172]:
# Kruskal-Wallis Test for polarity vs. business category

restaurants = combined_filtered_exploded[
    combined_filtered_exploded["categories"] == "Restaurants"
]["Polarity"]

events = combined_filtered_exploded[
    combined_filtered_exploded["categories"] == "Event Planning & Services"
]["Polarity"]

shopping = combined_filtered_exploded[
    combined_filtered_exploded["categories"] == "Shopping"
]["Polarity"]

beauty = combined_filtered_exploded[
    combined_filtered_exploded["categories"] == "Beauty & Spas"
]["Polarity"]

art = combined_filtered_exploded[
    combined_filtered_exploded["categories"] == "Arts & Entertainment"
]["Polarity"]

travel = combined_filtered_exploded[
    combined_filtered_exploded["categories"] == "Hotels & Travel"
]["Polarity"]

result = stats.kruskal(restaurants, events, shopping, beauty, art, travel)
dunn = sp.posthoc_dunn(
    combined_filtered_exploded,
    val_col="Polarity",
    group_col="categories",
    p_adjust="bonferroni",
)
print(
    "Kruskal Wallis & Dunn's Test Results for Polarity vs. Business Category"
)
print(f"Kruskal-Wallis Statistic: {result.statistic:.2f}")
print(f"P-Value: {result.pvalue:.2e}")
# display(dunn.round(4))
display(dunn < 0.05)
Kruskal Wallis & Dunn's Test Results for Polarity vs. Business Category
Kruskal-Wallis Statistic: 159.01
P-Value: 1.61e-32
Arts & Entertainment Beauty & Spas Event Planning & Services Hotels & Travel Restaurants Shopping
Arts & Entertainment False True True False True False
Beauty & Spas True False False True False True
Event Planning & Services True False False True False True
Hotels & Travel False True True False True False
Restaurants True False False True False True
Shopping False True True False True False
In [173]:
# Kruskal-Wallis Test for subjectivity vs. business category

restaurants = combined_filtered_exploded[
    combined_filtered_exploded["categories"] == "Restaurants"
]["Subjectivity"]

events = combined_filtered_exploded[
    combined_filtered_exploded["categories"] == "Event Planning & Services"
]["Subjectivity"]

shopping = combined_filtered_exploded[
    combined_filtered_exploded["categories"] == "Shopping"
]["Subjectivity"]

beauty = combined_filtered_exploded[
    combined_filtered_exploded["categories"] == "Beauty & Spas"
]["Subjectivity"]

art = combined_filtered_exploded[
    combined_filtered_exploded["categories"] == "Arts & Entertainment"
]["Subjectivity"]

travel = combined_filtered_exploded[
    combined_filtered_exploded["categories"] == "Hotels & Travel"
]["Subjectivity"]

result = stats.kruskal(restaurants, events, shopping, beauty, art, travel)
dunn = sp.posthoc_dunn(
    combined_filtered_exploded,
    val_col="Subjectivity",
    group_col="categories",
    p_adjust="bonferroni",
)
print(
    "Kruskal Wallis & Dunn's Test Results for Subjectivity vs. Business Category"
)
print(f"Kruskal-Wallis Statistic: {result.statistic:.2f}")
print(f"P-Value: {result.pvalue:.2e}")
# display(dunn.round(4))
display(dunn < 0.05)
Kruskal Wallis & Dunn's Test Results for Subjectivity vs. Business Category
Kruskal-Wallis Statistic: 362.05
P-Value: 4.46e-76
Arts & Entertainment Beauty & Spas Event Planning & Services Hotels & Travel Restaurants Shopping
Arts & Entertainment False False True False True False
Beauty & Spas False False True False True True
Event Planning & Services True True False True False True
Hotels & Travel False False True False True False
Restaurants True True False True False True
Shopping False True True False True False
In [174]:
# Kruskal-Wallis Test for polarity vs. business category

restaurants = combined_filtered_exploded[
    combined_filtered_exploded["categories"] == "Restaurants"
]["Abs_polarity"]

events = combined_filtered_exploded[
    combined_filtered_exploded["categories"] == "Event Planning & Services"
]["Abs_polarity"]

shopping = combined_filtered_exploded[
    combined_filtered_exploded["categories"] == "Shopping"
]["Abs_polarity"]

beauty = combined_filtered_exploded[
    combined_filtered_exploded["categories"] == "Beauty & Spas"
]["Abs_polarity"]

art = combined_filtered_exploded[
    combined_filtered_exploded["categories"] == "Arts & Entertainment"
]["Abs_polarity"]

travel = combined_filtered_exploded[
    combined_filtered_exploded["categories"] == "Hotels & Travel"
]["Abs_polarity"]

result = stats.kruskal(restaurants, events, shopping, beauty, art, travel)
dunn = sp.posthoc_dunn(
    combined_filtered_exploded,
    val_col="Abs_polarity",
    group_col="categories",
    p_adjust="bonferroni",
)
print(
    "Kruskal Wallis & Dunn's Test Results for Absolute Polarity vs. Business Category"
)
print(f"Kruskal-Wallis Statistic: {result.statistic:.2f}")
print(f"P-Value: {result.pvalue:.2e}")
# display(dunn.round(4))
display(dunn < 0.05)
Kruskal Wallis & Dunn's Test Results for Absolute Polarity vs. Business Category
Kruskal-Wallis Statistic: 170.14
P-Value: 6.80e-35
Arts & Entertainment Beauty & Spas Event Planning & Services Hotels & Travel Restaurants Shopping
Arts & Entertainment False True True False True False
Beauty & Spas True False False True False True
Event Planning & Services True False False True False True
Hotels & Travel False True True False True False
Restaurants True False False True False True
Shopping False True True False True False
In [ ]:
combined_filtered_exploded["stars_review"] = combined_filtered_exploded[
    "stars_review"
].astype(int)

combined_filtered_exploded["stars_business_round"] = np.floor(
    combined_filtered_exploded["stars_business"]
).astype(int)

sorted_categories_business = (
    combined_filtered_exploded.groupby("categories")["stars_business_round"]
    .value_counts(normalize=True)
    .to_frame()
    .reset_index(level="stars_business_round")[
        combined_filtered_exploded.groupby("categories")[
            "stars_business_round"
        ]
        .value_counts(normalize=True)
        .to_frame()
        .reset_index(level="stars_business_round")["stars_business_round"]
        == 5
    ]
    .sort_values(by='proportion',ascending=False)
    .index.tolist()
)
sorted_categories_review = (
    combined_filtered_exploded.groupby("categories")["stars_review"]
    .value_counts(normalize=True)
    .to_frame()
    .reset_index(level="stars_review")[
        combined_filtered_exploded.groupby("categories")[
            "stars_review"
        ]
        .value_counts(normalize=True)
        .to_frame()
        .reset_index(level="stars_review")["stars_review"]
        == 5
    ]
    .sort_values(by='proportion',ascending=False)
    .index.tolist()
)

combined_filtered_exploded["stars_review"] = pd.Categorical(
    combined_filtered_exploded["stars_review"]
)

combined_filtered_exploded["stars_business_round"] = pd.Categorical(
    combined_filtered_exploded["stars_business_round"]
)
In [178]:
# Create a figure and subplots
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Stacked Bar Chart of Business Categories with Business Star Ratings

combined_filtered_exploded["categories"] = pd.Categorical(
    combined_filtered_exploded["categories"],
    categories=sorted_categories_business,
    ordered=True,
)

sns.histplot(
    data=combined_filtered_exploded,
    y="categories",
    hue="stars_business_round",
    multiple="fill",
    stat="proportion",
    palette=sns.color_palette("hls", 5),
    hue_order=combined_filtered_exploded[
        "stars_business_round"
    ].cat.categories[::-1],
    discrete=True,
    shrink=0.8,
    ax=axes[0],
)
axes[0].set_title("Proportion of Business Star Ratings by Business Category")
axes[0].set_ylabel("Business Category")
axes[0].set_xlabel("Proportion")
sns.move_legend(
    axes[0],
    "upper center",
    bbox_to_anchor=(0.5, -0.15),
    ncol=5,
    title="Star Rating",
    reverse=True,
)

# Stacked Bar Chart of Business Categories with Review Star Ratings

combined_filtered_exploded["categories"] = pd.Categorical(
    combined_filtered_exploded["categories"],
    sorted_categories_review,
    ordered=True,
)

sns.histplot(
    data=combined_filtered_exploded,
    y="categories",
    hue="stars_review",
    multiple="fill",
    stat="proportion",
    palette=sns.color_palette("hls", 5),
    hue_order=combined_filtered_exploded["stars_review"].cat.categories[::-1],
    discrete=True,
    shrink=0.8,
    ax=axes[1],
)
axes[1].set_title("Proportion of Review Star Ratings by Business Category")
axes[1].set_ylabel("Business Category")
axes[1].set_xlabel("Proportion")
sns.move_legend(
    axes[1],
    "upper center",
    bbox_to_anchor=(0.5, -0.15),
    ncol=5,
    title="Star Rating",
    reverse=True,
)

# Adjust layout and show the combined plot
plt.tight_layout()
plt.show()
No description has been provided for this image

Remember, i had to round down the star ratings for this, also added 0.5

In [179]:
combined_filtered_exploded["categories"] = pd.Categorical(
    combined_filtered_exploded["categories"],
    category_order,
    ordered=True,
)
In [180]:
# Chi-square
contingency_table = pd.crosstab(
    combined_filtered_exploded["stars_business_round"],
    combined_filtered_exploded["categories"],
)

contingency_table += 0.5

# Overall Chi-square
chi2, p, dof, expected = stats.chi2_contingency(contingency_table)
print("Chi-Square Tests for Business Star Ratings by Category")
print("Chi-Square Statistic:", chi2.round(2))
print("P-Value:", p)

# Perform Chi-Square test for each pair of categories
categories = contingency_table.columns
p_values = []
comparisons = []
for cat1, cat2 in combinations(categories, 2):
    sub_table = contingency_table[[cat1, cat2]]
    chi2, p, _, _ = stats.chi2_contingency(sub_table)
    p_values.append(p)
    comparisons.append(f"{cat1} vs {cat2}")
# Adjust p-values using Bonferroni correction
_, p_adjusted, _, _ = smm.multipletests(p_values, method="bonferroni")
# Create a DataFrame for the results
post_hoc_results = pd.DataFrame(
    {
        "Comparison": comparisons,
        "Adjusted P-Value": p_adjusted,
    }
)
# Add significance column
post_hoc_results["Significant"] = post_hoc_results["Adjusted P-Value"] < 0.05
post_hoc_results = post_hoc_results.set_index("Comparison")
post_hoc_results.index.name = None

display(post_hoc_results)
Chi-Square Tests for Business Star Ratings by Category
Chi-Square Statistic: 5772.58
P-Value: 0.0
Adjusted P-Value Significant
Restaurants vs Event Planning & Services 0.000000e+00 True
Restaurants vs Shopping 0.000000e+00 True
Restaurants vs Beauty & Spas 0.000000e+00 True
Restaurants vs Arts & Entertainment 4.753070e-59 True
Restaurants vs Hotels & Travel 0.000000e+00 True
Event Planning & Services vs Shopping 1.839501e-93 True
Event Planning & Services vs Beauty & Spas 6.533687e-41 True
Event Planning & Services vs Arts & Entertainment 1.876549e-87 True
Event Planning & Services vs Hotels & Travel 8.423155e-54 True
Shopping vs Beauty & Spas 1.556177e-28 True
Shopping vs Arts & Entertainment 4.248370e-57 True
Shopping vs Hotels & Travel 2.592184e-31 True
Beauty & Spas vs Arts & Entertainment 3.539321e-76 True
Beauty & Spas vs Hotels & Travel 1.637500e-24 True
Arts & Entertainment vs Hotels & Travel 5.797439e-99 True
In [181]:
# Chi-square
contingency_table = pd.crosstab(
    combined_filtered_exploded["stars_review"],
    combined_filtered_exploded["categories"],
)

contingency_table += 0.5

# Overall Chi-square
chi2, p, dof, expected = stats.chi2_contingency(contingency_table)
print("Chi-Square Tests for Review Star Ratings by Category")
print("Chi-Square Statistic:", chi2.round(2))
print("P-Value:", p)

# Perform Chi-Square test for each pair of categories
categories = contingency_table.columns
p_values = []
comparisons = []
for cat1, cat2 in combinations(categories, 2):
    sub_table = contingency_table[[cat1, cat2]]
    chi2, p, _, _ = stats.chi2_contingency(sub_table)
    p_values.append(p)
    comparisons.append(f"{cat1} vs {cat2}")
# Adjust p-values using Bonferroni correction
_, p_adjusted, _, _ = smm.multipletests(p_values, method="bonferroni")
# Create a DataFrame for the results
post_hoc_results = pd.DataFrame(
    {
        "Comparison": comparisons,
        "Adjusted P-Value": p_adjusted,
    }
)
# Add significance column
post_hoc_results["Significant"] = post_hoc_results["Adjusted P-Value"] < 0.05
post_hoc_results = post_hoc_results.set_index("Comparison")
post_hoc_results.index.name = None

display(post_hoc_results)
Chi-Square Tests for Review Star Ratings by Category
Chi-Square Statistic: 1322.49
P-Value: 4.5204657190659196e-268
Adjusted P-Value Significant
Restaurants vs Event Planning & Services 1.026970e-38 True
Restaurants vs Shopping 1.665348e-88 True
Restaurants vs Beauty & Spas 1.160864e-140 True
Restaurants vs Arts & Entertainment 8.086487e-01 False
Restaurants vs Hotels & Travel 3.864245e-45 True
Event Planning & Services vs Shopping 1.246295e-23 True
Event Planning & Services vs Beauty & Spas 1.014245e-40 True
Event Planning & Services vs Arts & Entertainment 2.167515e-16 True
Event Planning & Services vs Hotels & Travel 1.708936e-16 True
Shopping vs Beauty & Spas 8.806005e-23 True
Shopping vs Arts & Entertainment 3.255638e-32 True
Shopping vs Hotels & Travel 4.090113e-02 True
Beauty & Spas vs Arts & Entertainment 1.097079e-77 True
Beauty & Spas vs Hotels & Travel 5.170178e-12 True
Arts & Entertainment vs Hotels & Travel 4.089392e-29 True
In [182]:
combined["date"] = pd.to_datetime(combined["date"], errors="coerce")
combined_filtered_exploded["date"] = pd.to_datetime(
    combined_filtered_exploded["date"], errors="coerce"
)
In [39]:
# combined_shorter = combined_filtered_exploded[
#     (combined_filtered_exploded["date"] > (pd.to_datetime('2020-01-01 00:00:00') - pd.DateOffset(years=2))) &
#     (combined_filtered_exploded["date"] < pd.to_datetime('2020-01-01 00:00:00'))
# ]

# combined_shorter = combined_filtered_exploded[
#     (
#         combined_filtered_exploded["date"]
#         > (pd.to_datetime("2020-01-01 00:00:00") - pd.DateOffset(years=5))
#     )
# & (
#     combined_filtered_exploded["date"]
#     < pd.to_datetime("2020-01-01 00:00:00")
# )
# ]

combined_filtered_exploded["stars_review"] = pd.to_numeric(
    combined_filtered_exploded["stars_review"]
)
In [40]:
fig = px.scatter(
    combined_filtered_exploded,
    x="date",
    y="stars_review",
    trendline="lowess",
    title="Review Star Ratings Over Time",
    color="categories",
    category_orders={"categories": category_order},
)

# Loop through fig.data and adjust the legend visibility
for trace in fig.data:
    if trace.mode == "markers":  # Scatter points
        trace.showlegend = False  # Do not show in the legend
    elif trace.mode == "lines":  # Trendline (LOWESS)
        trace.showlegend = True  # Show the trendline in the legend

fig.data = fig.data[1::2]

fig.update_layout(
    xaxis_title="Date",
    yaxis_title="Star Ratings",
    font_family="Times New Roman",
    font_color="black",
    legend_title="Category",
    showlegend=True,
    legend=dict(
        x=1.35,  # Horizontal position of the legend
        y=0.5,  # Vertical position of the legend
        xanchor="right",  # Anchor the legend horizontally to the right
        yanchor="middle",  # Anchor the legend vertically to the center
    ),
)


fig.show(renderer="notebook")
In [ ]:
fig1 = px.scatter(
    combined_filtered_exploded,
    x="date",
    y="Polarity",
    trendline="lowess",
    title="Review Polarity Over Time",
    color="categories",
    category_orders={"categories": category_order},
)

# Loop through fig.data and adjust the legend visibility
for trace in fig1.data:
    if trace.mode == "markers":  # Scatter points
        trace.showlegend = False  # Do not show in the legend
    elif trace.mode == "lines":  # Trendline (LOWESS)
        trace.showlegend = True  # Show the trendline in the legend

fig1.data = fig1.data[1::2]

fig1.update_layout(
    xaxis_title="Date",
    yaxis_title="Polarity",
    font_family="Times New Roman",
    font_color="black",
    legend_title="Category",
    showlegend=True,
    legend=dict(
        x=1.35,  # Horizontal position of the legend
        y=0,  # Vertical position of the legend
        xanchor="right",  # Anchor the legend horizontally to the right
        yanchor="bottom",  # Anchor the legend vertically to the center
    ),
)

fig2 = px.scatter(
    combined_filtered_exploded,
    x="date",
    y="Subjectivity",
    trendline="lowess",
    title="Review Subjectivity Over Time",
    color="categories",
    category_orders={"categories": category_order},
)

fig2.data = fig2.data[1::2]

fig2.update_layout(
    xaxis_title="Date",
    yaxis_title="Subjectivity",
    font_family="Times New Roman",
    font_color="black",
    legend_title="Category",
)

fig3 = px.scatter(
    combined_filtered_exploded,
    x="date",
    y="Abs_polarity",
    trendline="lowess",
    title="Review Absolute Polarity Over Time",
    color="categories",
    category_orders={"categories": category_order},
)

fig3.data = fig3.data[1::2]

fig3.update_layout(
    xaxis_title="Date",
    yaxis_title="Absolute Polarity",
    font_family="Times New Roman",
    font_color="black",
)

# Create subplots with two columns
fig = make_subplots(
    rows=2,
    cols=2,
    subplot_titles=(
        "Review Polarity Over Time",
        "Review Subjectivity Over Time",
        "Review Absolute Polarity Over Time",
    ),  # Titles for each subplot
)

# Add the traces from fig1 to the first subplot (column 1)
for trace in fig1.data:
    fig.add_trace(trace, row=1, col=1)

# Add the traces from fig2 to the second subplot (column 2)
for trace in fig2.data:
    fig.add_trace(trace, row=1, col=2)

for trace in fig3.data:
    fig.add_trace(trace, row=2, col=1)

fig.show(renderer="notebook")
In [ ]: